Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new pre-trained models BERTweet and PhoBERT #6129

Merged
merged 34 commits into from Sep 18, 2020
Merged

Add new pre-trained models BERTweet and PhoBERT #6129

merged 34 commits into from Sep 18, 2020

Conversation

datquocnguyen
Copy link
Contributor

@datquocnguyen datquocnguyen commented Jul 29, 2020

I'd like to add pre-trained BERTweet and PhoBERT models to the transformers library.

Users now can use these models directly from transformers. E.g:

bertweettokenizer = BertweetTokenizer.from_pretrained("vinai/bertweet-base")
bertweetmodel = BertweetModel.from_pretrained("vinai/bertweet-base")

phoberttokenizer = PhobertTokenizer.from_pretrained("vinai/phobert-large")
phobertmodel = PhobertModel.from_pretrained("vinai/phobert-large")

BERTweet: A pre-trained language model for English Tweets
PhoBERT: Pre-trained language models for Vietnamese

@julien-c julien-c added the model card Related to pretrained model cards label Jul 29, 2020
Re-add `bart` to LM_MAPPING
Re-add `from .configuration_mobilebert import MobileBertConfig`
not sure why it's replaced by `from transformers.configuration_mobilebert import MobileBertConfig`
@datquocnguyen datquocnguyen changed the title Add BERTweet and PhoBERT models Add new pre-trained models BERTweet and PhoBERT Jul 29, 2020
datquocnguyen and others added 3 commits July 30, 2020 09:10
Remove BertweetTokenizer and PhobertTokenizer out of tokenization_auto.py (they are currently not supported by AutoTokenizer.
@datquocnguyen
Copy link
Contributor Author

datquocnguyen commented Jul 30, 2020

I'd like to add pre-trained BERTweet and PhoBERT models to the transformers library.

Users now can use these models directly from transformers. E.g:

bertweettokenizer = BertweetTokenizer.from_pretrained("vinai/bertweet-base")
bertweetmodel = BertweetModel.from_pretrained("vinai/bertweet-base")

phoberttokenizer = PhobertTokenizer.from_pretrained("vinai/phobert-large")
phobertmodel = PhobertModel.from_pretrained("vinai/phobert-large")

BERTweet: A pre-trained language model for English Tweets
PhoBERT: Pre-trained language models for Vietnamese

Whether I can get any support from huggingface w.r.t. this pull request @julien-c ? Thanks.

@LysandreJik LysandreJik self-requested a review July 31, 2020 08:53
@LysandreJik
Copy link
Member

Hello @datquocnguyen ! As you've said, BERTweet and PhoBERT reimplement the RoBERTa model without adding any special behavior. I don't think it's necessary to reimplement them then, is it? Uploading them on the hub should be enough to load them into RoBERTa architectures, right?

@datquocnguyen
Copy link
Contributor Author

Hi @LysandreJik
They use different tokenizers (i.e. fastBPE), so we cannot load their tokenizers using RoBERTa.
Please see a loading example using RoBERTa: https://github.com/VinAIResearch/BERTweet#transformers
An issue related to this is at: #5965

@datquocnguyen
Copy link
Contributor Author

datquocnguyen commented Jul 31, 2020

I hope both BERTweet and PhoBERT could be incorporated into transformers in a similar manner to as their counterparts (e.g. CamemBERT and FlauBERT). @LysandreJik Please let me know what I can do for this. Thanks.

@LysandreJik LysandreJik self-assigned this Jul 31, 2020
@LysandreJik
Copy link
Member

Yes, I understand, that makes sense. There shouldn't be any issue in incorporating them into transformers.

@LysandreJik
Copy link
Member

I've taken a quick look at it, and it looks very cool! Something that we can maybe do better, is regarding the tokenizers:

  • They're currently untested, but they're the main contribution of this PR so they definitely should be tested.
  • If possible, we would like not to add an additional dependency (in this case FastBPE). It would be great to leverage the already existing library huggingface/tokenizers
  • On that front, given it's a BPE tokenizer, it should be easy enough to leverage the OpenAI GPT (not GPT-2) tokenizer, which seems very similar. It might even be possible to load the vocab/merge files directly in OpenAIGPTTokenizer.

Let me know what you think!

@LysandreJik
Copy link
Member

Haven't tried it directly, but as seen with @n1t0, since you're not doing any fancy pre-processing it might be as simple as the following:

class PhobertTokenizerFast(PreTrainedTokenizerFast):
    vocab_files_names = VOCAB_FILES_NAMES
    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
    model_input_names = ["attention_mask"]
    def __init__(self, vocab_file, merges_file, unk_token="<unk>", **kwargs):
        kwargs.setdefault("unk_token", unk_token)
        super().__init__(
            CharBPETokenizer(vocab_file=vocab_file, merges_file=merges_file, unk_token=unk_token, lowercase=False, bert_normalizer=False, split_on_whitespace_only=True),
            **kwargs,
        )

@datquocnguyen
Copy link
Contributor Author

Thanks very much @LysandreJik I will revise the code following your comments and inform you as soon as I complete it.

@JetRunner
Copy link
Contributor

JetRunner commented Aug 1, 2020

@datquocnguyen Yeah, these models are cool. Lovin' it. I think we can try to figure out how to convert fastBPE formats to our compatible format before adding it directly to our dependency (I believe XLM uses fastBPE). so would you hold on a little when we try to figure it out? We have to be cautious when adding dependencies! Thanks!
cc @LysandreJik

@datquocnguyen
Copy link
Contributor Author

Yes. Thanks @JetRunner

@justinphan3110
Copy link

some tokenizer function (decode, convert_ids_to_tokens) hasn't implemented for PhoBertTokenizer right?

@Miopas
Copy link

Miopas commented Aug 10, 2020

@datquocnguyen Thank you for this pull request. I tried the Bertweet model and met a problem that the tokenizer encoded special symbols like "<pad>" not as a whole token. Instead, it would split the string into characters like "< p a d >". I fixed the problem by modifying the code at `` as below:

--- a/BERTweet/transformers/tokenization_bertweet.py
+++ b/BERTweet/transformers/tokenization_bertweet.py
@@ -242,9 +242,14 @@ class BertweetTokenizer(PreTrainedTokenizer):
             text = self.normalizeTweet(text)
         return self.bpe.apply([text])[0].split()

-    def convert_tokens_to_ids(self, tokens):
-        """ Converts a list of str tokens into a list of ids using the vocab."""
-        return self.vocab.encode_line(" ".join(tokens), append_eos=False, add_if_not_exist=False).long().tolist()
+    def _convert_token_to_id(self, token):
+        #""" Converts a list of str tokens into a list of ids using the vocab."""
+        #return self.vocab.encode_line(" ".join(tokens), append_eos=False, add_if_not_exist=False).long().tolist()
+        return self.vocab.encode_line(token, append_eos=False, add_if_not_exist=False).long().tolist()[0]
+
+    @property
+    def vocab_size(self) -> int:
+        return len(self.vocab)

From my understanding, to encode a sentence, the order of the interfaces called in this case are PreTrainedTokenizerBase::encode
->PreTrainedTokenizer::_encode_plus
->PreTrainedTokenizer::convert_tokens_to_ids
->PreTrainedTokenizer::_convert_token_to_id_with_added_voc
->BertweetTokenizer::_convert_token_to_id for non-special tokens or PreTrainedTokenizer::added_tokens_encoder for special tokens.
So in the class BertweetTokenizer, it should implement the interface _convert_token_to_id rather than convert_tokens_to_ids.

@datquocnguyen
Copy link
Contributor Author

I will have a look soon. Thanks @Miopas.

@SergioBarretoJr
Copy link

I have just tried "BertweetTokenizer" and got this error:

"ImportError: cannot import name 'BertweetTokenizer' from 'transformers' (/home/apps/anaconda3/lib/python3.7/site-packages/transformers/init.py)"

Is there any solution to it?

I have also tried:

tokenizer2 = BertTokenizer.from_pretrained("vinai/bertweet-base")
trained = tokenizer2.encode("oops!! pelosi & dems admit numbers submitted to cbo are false! someurl #tcot #tlot #sgp #hcr #p2")

and got:
trained = [None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]

Is there any solution to it?

thks!

@codecov
Copy link

codecov bot commented Sep 17, 2020

Codecov Report

Merging #6129 into master will decrease coverage by 0.24%.
The diff coverage is 71.14%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6129      +/-   ##
==========================================
- Coverage   80.32%   80.08%   -0.25%     
==========================================
  Files         168      170       +2     
  Lines       32285    32642     +357     
==========================================
+ Hits        25932    26140     +208     
- Misses       6353     6502     +149     
Impacted Files Coverage Δ
src/transformers/tokenization_bertweet.py 63.18% <63.18%> (ø)
src/transformers/tokenization_phobert.py 83.45% <83.45%> (ø)
src/transformers/__init__.py 99.34% <100.00%> (+<0.01%) ⬆️
src/transformers/tokenization_auto.py 92.06% <100.00%> (+0.26%) ⬆️
src/transformers/modeling_tf_t5.py 26.05% <0.00%> (-63.52%) ⬇️
src/transformers/modeling_tf_gpt2.py 71.84% <0.00%> (-23.17%) ⬇️
src/transformers/modeling_lxmert.py 70.01% <0.00%> (-20.75%) ⬇️
src/transformers/modeling_transfo_xl_utilities.py 52.98% <0.00%> (-13.44%) ⬇️
src/transformers/modeling_transfo_xl.py 67.10% <0.00%> (-12.67%) ⬇️
src/transformers/tokenization_roberta.py 87.67% <0.00%> (-10.96%) ⬇️
... and 21 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b0cbcdb...257b9f1. Read the comment docs.

@napsternxg
Copy link

@datquocnguyen can you also upload your model files on https://huggingface.co/vinai/bertweet-base

I still get this error:

⚠️ Model name 'vinai/bertweet-base' was not found in tokenizers model name list (roberta-base, roberta-large, roberta-large-mnli, distilroberta-base, roberta-base-openai-detector, roberta-large-openai-detector). We assumed 'vinai/bertweet-base' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.json', 'merges.txt'] but couldn't find such vocabulary files at this path or url.

@napsternxg
Copy link

napsternxg commented Sep 17, 2020

@datquocnguyen I looked a the PR and looking forward to this merge. I have a few suggestions:

  1. I find the Phobert and Bertweet models to be quite similar. This makes the tokenizers also similar so we should not need a seperate tokenizer for both. Given that both these tokenizers just load fastBPE tokenizer data format, we can simply call them fastBPETokenizer.

  2. Looking at this other code which also uses fastBPE 1 can't we just follow it to convert the fastBPE tokenizer files to the huggingface format.

    • You can easily convert your bpe.codes into merges.txt file and then use the Roberta tokenizer.
    • The format is the same and you only need to drop the 3rd column in your BPE.codes and add a top line for comment.
    • In your code you are not even using the last column values.
    • Your merges.txt can have the following as the first line #version: 1 (look at merges.txt file of Roberta 2)

@datquocnguyen
Copy link
Contributor Author

datquocnguyen commented Sep 17, 2020

Hi @napsternxg The model had been already uploaded to https://huggingface.co/vinai/bertweet-base. For now, you would have to install transformers from our development branch (as it has not merged to the master branch of transformers yet). Did you try the following commands?

  • Python version >= 3.6
  • PyTorch version >= 1.4.0
  • Install transformers from our development branch:
    • git clone https://github.com/datquocnguyen/transformers.git
    • cd transformers
    • pip install --upgrade .
  • Install emoji: pip3 install emoji

Thanks for your suggestions. BertweetTokenizer is specifically designed to work on Tweet data, incorporating a TwitterTokenizer while PhoBERT does not. Note that both our vocab.txt and bpe.codes are also used in loading our models in fairseq. So I would prefer to keep them intact rather than converting them into another format.

@datquocnguyen
Copy link
Contributor Author

datquocnguyen commented Sep 17, 2020

Btw, I should mention that BERTweet is accepted as an EMNLP-2020 demo paper while PhoBERT gets a slot in the Findings of EMNLP-2020 volume. Please help review this pull request so that others might benefit from using them directly from the master branch of transformers. Thanks. @LysandreJik @JetRunner @julien-c
All checks have passed and you only need to merge files for the tokenizers and associated tests.

@napsternxg
Copy link

napsternxg commented Sep 17, 2020

Thanks that makes sense.
@datquocnguyen I was trying to use it from the models website.
My suggestion on the bpe.codes file was not to remove it but to generate the merges.txt file from it, which will make it compatible with the huggingface tokenizer.

@datquocnguyen
Copy link
Contributor Author

@napsternxg Please remove your "transformers" cache folder from ~/.cache/torch and reinstall transformers from our development branch. I am sure that bertweet would work smoothly:

import torch
from transformers import AutoModel, AutoTokenizer

bertweet = AutoModel.from_pretrained("vinai/bertweet-base")
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")

# INPUT TWEET IS ALREADY NORMALIZED!
line = "SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry:"

input_ids = torch.tensor([tokenizer.encode(line)])

with torch.no_grad():
    features = bertweet(input_ids)  # Models outputs are now tuples

@tienthanhdhcn
Copy link

@datquocnguyen great work and I am looking forward to seeing the PR gets merged so that I can use the models directly from the huggingface transformers.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I think this is great, I have nothing to add. LGTM, thanks for adding tests!

@LysandreJik
Copy link
Member

Will merge today unless @julien-c, @JetRunner have comments.

@julien-c
Copy link
Member

LGTM, do not hesitate to make the tokenizers as generic/configurable as possible, but this can be in a subsequent PR

@LysandreJik LysandreJik merged commit af2322c into huggingface:master Sep 18, 2020
fabiocapsouza pushed a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020
* Add BERTweet and PhoBERT models

* Update modeling_auto.py

Re-add `bart` to LM_MAPPING

* Update tokenization_auto.py

Re-add `from .configuration_mobilebert import MobileBertConfig`
not sure why it's replaced by `from transformers.configuration_mobilebert import MobileBertConfig`

* Add BERTweet and PhoBERT to pretrained_models.rst

* Update tokenization_auto.py

Remove BertweetTokenizer and PhobertTokenizer out of tokenization_auto.py (they are currently not supported by AutoTokenizer.

* Update BertweetTokenizer - without nltk

* Update model card for BERTweet

* PhoBERT - with Auto mode - without import fastBPE

* PhoBERT - with Auto mode - without import fastBPE

* BERTweet - with Auto mode - without import fastBPE

* Add PhoBERT and BERTweet to TF modeling auto

* Improve Docstrings for PhobertTokenizer and BertweetTokenizer

* Update PhoBERT and BERTweet model cards

* Fixed a merge conflict in tokenization_auto

* Used black to reformat BERTweet- and PhoBERT-related files

* Used isort to reformat BERTweet- and PhoBERT-related files

* Reformatted BERTweet- and PhoBERT-related files based on flake8

* Updated test files

* Updated test files

* Updated tf test files

* Updated tf test files

* Updated tf test files

* Updated tf test files

* Update commits from huggingface

* Delete unnecessary files

* Add tokenizers to auto and init files

* Add test files for tokenizers

* Revised model cards

* Update save_vocabulary function in BertweetTokenizer and PhobertTokenizer and test files

* Revised test files

* Update orders of Phobert and Bertweet tokenizers in auto tokenization file
fabiocapsouza added a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020
@thanhphi0401
Copy link

Any news on it? when Phobert available on HuggingFace?

@LysandreJik
Copy link
Member

It's been available since September:

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base")

model = AutoModelForMaskedLM.from_pretrained("vinai/phobert-base")

You can see the model card here.

@thanhphi0401
Copy link

It's been available since September:

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base")

model = AutoModelForMaskedLM.from_pretrained("vinai/phobert-base")

You can see the model card here.

But i don't see here https://huggingface.co/transformers/pretrained_models.html
How can i integrate with Rasa NLU sir?
Thank you

@LysandreJik
Copy link
Member

PhoBERT is based off of the RoBERTa implementation, so you can load it in a RobertaForMaskedLM model. The tokenizer is custom, so you should load it through the PhobertTokenizer.

I have never used Rasa NLU, so I can't help you much here. Your best option would be to open a thread on our forum with an example of how you do things for other models, so as not to flood this PR.

You can ping me on the thread (@Lysandre).

@datquocnguyen datquocnguyen mentioned this pull request May 4, 2022
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
model card Related to pretrained model cards
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants